• Term Paper Project
  • Winter 2018, DSPA (HS650)
  • Name: Xiayi Li
  • UMich E-mail: xiayi@umich.edu
  • I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.

1 Abstract

The main goal of our project is to predict whether a driver will file a claim in the next year, which help the insurance company to tailor their prices. Our data is from Porto Seguro company, which is one of Brazil’s largest auto and homeowner insurance companies. Since the data involves some personal information, so the raw data we got is masked data, which means we don’t have any prior information of what these variables are, the only information is the groups(features that belong to similar groupings are tagged as such in the feature names) and the type of the variables(binary, chategorical, ect). We first did data exploration, deleted two variables with big proportion of missing data, and resampled our data to get a more balance data set. We fit five models in all, Logistic Regression, PCA+C5.0, RandomForest, Neural Networks and XGBoost. In addition, we used confusionMatrix, ROC Curve to evaluate and visualize the performances of the five models. It turned out that XGBoost is the best algorithm for this case, the prediction is almost accurate, while the other methods - Logistic Regression, PCA+C5.0, RandomForst, Neural Networks don’t have high accuracy in this case.

2 Introduction and Background

  • Our project is a kaggle competition project. The data is from Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, and all the variables are past information about the drivers. Since inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones. In this project, we want to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. A more accurate prediction will allow the company to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers.

  • Prediction/Hypothesis: Whether a driver will file a claim is associated with the other variables, and it could be predicted using machine learning algorithms based on the past information.

3 Data Exploration

Load all packages we will use in this project.

library(data.table)
library(tibble)
library(purrr)
library(ggplot2)
library(ROSE)
library(dplyr)
library(magrittr)
library(gridExtra)
library(corrplot)
library(caret)
library(rvest)
library(factoextra)
library(randomForest)
library(pROC)
library(ROCR)
library(C50)
library(drat)
library(xgboost)
library(neuralnet)

Data Overview:

  • Features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc).
  • Feature names include the postfix bin to indicate binary features and cat to indicate categorical features.

  • Features without these designations are either continuous or ordinal.
  • Values of -1 indicate that the feature was missing from the observation.
  • The target columns signifies whether or not a claim was filed for that policy holder.

3.1 Read in Data

By tibble, the speed of reading data is faster. For large size data, this will be useful. We have 595212 observations and 59 variables in total.

# In this data, "-1" indicates NA. 
PS <- as.tibble(fread('train.csv', na.strings=c("-1","-1.0")))
## 
Read 0.0% of 595212 rows
Read 50.4% of 595212 rows
Read 97.4% of 595212 rows
Read 595212 rows and 59 (of 59) columns from 0.108 GB file in 00:00:05
str(PS)
## Classes 'tbl_df', 'tbl' and 'data.frame':    595212 obs. of  59 variables:
##  $ id            : int  7 9 13 16 17 19 20 22 26 28 ...
##  $ target        : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ ps_ind_01     : int  2 1 5 0 0 5 2 5 5 1 ...
##  $ ps_ind_02_cat : int  2 1 4 1 2 1 1 1 1 1 ...
##  $ ps_ind_03     : int  5 7 9 2 0 4 3 4 3 2 ...
##  $ ps_ind_04_cat : int  1 0 1 0 1 0 1 0 1 0 ...
##  $ ps_ind_05_cat : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_06_bin : int  0 0 0 1 1 0 0 1 0 0 ...
##  $ ps_ind_07_bin : int  1 0 0 0 0 0 1 0 0 1 ...
##  $ ps_ind_08_bin : int  0 1 1 0 0 0 0 0 1 0 ...
##  $ ps_ind_09_bin : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ ps_ind_10_bin : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_11_bin : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_12_bin : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_13_bin : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_14     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_15     : int  11 3 12 8 9 6 8 13 6 4 ...
##  $ ps_ind_16_bin : int  0 0 1 1 1 1 1 1 1 0 ...
##  $ ps_ind_17_bin : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_18_bin : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ ps_reg_01     : num  0.7 0.8 0 0.9 0.7 0.9 0.6 0.7 0.9 0.9 ...
##  $ ps_reg_02     : num  0.2 0.4 0 0.2 0.6 1.8 0.1 0.4 0.7 1.4 ...
##  $ ps_reg_03     : num  0.718 0.766 NA 0.581 0.841 ...
##  $ ps_car_01_cat : int  10 11 7 7 11 10 6 11 10 11 ...
##  $ ps_car_02_cat : int  1 1 1 1 1 0 1 1 1 0 ...
##  $ ps_car_03_cat : int  NA NA NA 0 NA NA NA 0 NA 0 ...
##  $ ps_car_04_cat : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ ps_car_05_cat : int  1 NA NA 1 NA 0 1 0 1 0 ...
##  $ ps_car_06_cat : int  4 11 14 11 14 14 11 11 14 14 ...
##  $ ps_car_07_cat : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_car_08_cat : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ ps_car_09_cat : int  0 2 2 3 2 0 0 2 0 2 ...
##  $ ps_car_10_cat : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_car_11_cat : int  12 19 60 104 82 104 99 30 68 104 ...
##  $ ps_car_11     : int  2 3 1 1 3 2 2 3 3 2 ...
##  $ ps_car_12     : num  0.4 0.316 0.316 0.374 0.316 ...
##  $ ps_car_13     : num  0.884 0.619 0.642 0.543 0.566 ...
##  $ ps_car_14     : num  0.371 0.389 0.347 0.295 0.365 ...
##  $ ps_car_15     : num  3.61 2.45 3.32 2 2 ...
##  $ ps_calc_01    : num  0.6 0.3 0.5 0.6 0.4 0.7 0.2 0.1 0.9 0.7 ...
##  $ ps_calc_02    : num  0.5 0.1 0.7 0.9 0.6 0.8 0.6 0.5 0.8 0.8 ...
##  $ ps_calc_03    : num  0.2 0.3 0.1 0.1 0 0.4 0.5 0.1 0.6 0.8 ...
##  $ ps_calc_04    : int  3 2 2 2 2 3 2 1 3 2 ...
##  $ ps_calc_05    : int  1 1 2 4 2 1 2 2 1 2 ...
##  $ ps_calc_06    : int  10 9 9 7 6 8 8 7 7 8 ...
##  $ ps_calc_07    : int  1 5 1 1 3 2 1 1 3 2 ...
##  $ ps_calc_08    : int  10 8 8 8 10 11 8 6 9 9 ...
##  $ ps_calc_09    : int  1 1 2 4 2 3 3 1 4 1 ...
##  $ ps_calc_10    : int  5 7 7 2 12 8 10 13 11 11 ...
##  $ ps_calc_11    : int  9 3 4 2 3 4 3 7 4 3 ...
##  $ ps_calc_12    : int  1 1 2 2 1 2 0 1 2 5 ...
##  $ ps_calc_13    : int  5 1 7 4 1 0 0 3 1 0 ...
##  $ ps_calc_14    : int  8 9 7 9 3 9 10 6 5 6 ...
##  $ ps_calc_15_bin: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ ps_calc_16_bin: int  1 1 1 0 0 1 1 0 1 1 ...
##  $ ps_calc_17_bin: int  1 1 1 0 0 0 0 1 0 0 ...
##  $ ps_calc_18_bin: int  0 0 0 0 1 1 0 0 0 0 ...
##  $ ps_calc_19_bin: int  0 1 1 0 1 1 1 1 0 1 ...
##  $ ps_calc_20_bin: int  1 0 0 0 0 1 0 0 1 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

3.2 Exam missing values

dfmi<-data.frame(feature = names(PS), per_miss = map_dbl(PS, function(x) { sum(is.na(x))/length(x) }))
ggplot(data=dfmi,aes(x = feature, y = per_miss)) + 
    geom_bar(stat = 'identity', color = 'white', fill = '#5a64cd') +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
    labs(x = '', y = '% missing', title = 'Missing Values by Feature') + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    scale_y_continuous(labels = scales::percent)

  • Delete “ps_car_03_cat”,“ps_car_05_cat” and “ps_reg_03” with big proportion of missing data.
  • Delete “ps_car_11_cat” since the levels of it is abnormal.
  • Delete “id”.
colnames(PS)
##  [1] "id"             "target"         "ps_ind_01"      "ps_ind_02_cat" 
##  [5] "ps_ind_03"      "ps_ind_04_cat"  "ps_ind_05_cat"  "ps_ind_06_bin" 
##  [9] "ps_ind_07_bin"  "ps_ind_08_bin"  "ps_ind_09_bin"  "ps_ind_10_bin" 
## [13] "ps_ind_11_bin"  "ps_ind_12_bin"  "ps_ind_13_bin"  "ps_ind_14"     
## [17] "ps_ind_15"      "ps_ind_16_bin"  "ps_ind_17_bin"  "ps_ind_18_bin" 
## [21] "ps_reg_01"      "ps_reg_02"      "ps_reg_03"      "ps_car_01_cat" 
## [25] "ps_car_02_cat"  "ps_car_03_cat"  "ps_car_04_cat"  "ps_car_05_cat" 
## [29] "ps_car_06_cat"  "ps_car_07_cat"  "ps_car_08_cat"  "ps_car_09_cat" 
## [33] "ps_car_10_cat"  "ps_car_11_cat"  "ps_car_11"      "ps_car_12"     
## [37] "ps_car_13"      "ps_car_14"      "ps_car_15"      "ps_calc_01"    
## [41] "ps_calc_02"     "ps_calc_03"     "ps_calc_04"     "ps_calc_05"    
## [45] "ps_calc_06"     "ps_calc_07"     "ps_calc_08"     "ps_calc_09"    
## [49] "ps_calc_10"     "ps_calc_11"     "ps_calc_12"     "ps_calc_13"    
## [53] "ps_calc_14"     "ps_calc_15_bin" "ps_calc_16_bin" "ps_calc_17_bin"
## [57] "ps_calc_18_bin" "ps_calc_19_bin" "ps_calc_20_bin"
ps<-as.data.frame(PS[,-c(1,23,26,28,34)])

3.3 Impute missing values

Impute the category features by mode and numeric features by median.

sapply(ps, function(x) sum(is.na(x)))
##         target      ps_ind_01  ps_ind_02_cat      ps_ind_03  ps_ind_04_cat 
##              0              0            216              0             83 
##  ps_ind_05_cat  ps_ind_06_bin  ps_ind_07_bin  ps_ind_08_bin  ps_ind_09_bin 
##           5809              0              0              0              0 
##  ps_ind_10_bin  ps_ind_11_bin  ps_ind_12_bin  ps_ind_13_bin      ps_ind_14 
##              0              0              0              0              0 
##      ps_ind_15  ps_ind_16_bin  ps_ind_17_bin  ps_ind_18_bin      ps_reg_01 
##              0              0              0              0              0 
##      ps_reg_02  ps_car_01_cat  ps_car_02_cat  ps_car_04_cat  ps_car_06_cat 
##              0            107              5              0              0 
##  ps_car_07_cat  ps_car_08_cat  ps_car_09_cat  ps_car_10_cat      ps_car_11 
##          11489              0            569              0              5 
##      ps_car_12      ps_car_13      ps_car_14      ps_car_15     ps_calc_01 
##              1              0          42620              0              0 
##     ps_calc_02     ps_calc_03     ps_calc_04     ps_calc_05     ps_calc_06 
##              0              0              0              0              0 
##     ps_calc_07     ps_calc_08     ps_calc_09     ps_calc_10     ps_calc_11 
##              0              0              0              0              0 
##     ps_calc_12     ps_calc_13     ps_calc_14 ps_calc_15_bin ps_calc_16_bin 
##              0              0              0              0              0 
## ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin 
##              0              0              0              0
mode <- function (x, na.rm) {
  xtab <- table(x)
  xmode <- names(which(xtab == max(xtab)))
  if (length(xmode) > 1) xmode <- ">1 mode"
  return(xmode)
}
ps$ps_ind_05_cat[is.na(ps$ps_ind_05_cat)] <- mode(ps$ps_ind_05_cat, na.rm = TRUE)
ps$ps_car_07_cat[is.na(ps$ps_car_07_cat)] <-mode(ps$ps_car_07_cat, na.rm = TRUE)
ps$ps_car_14[is.na(ps$ps_car_14)] <- median(ps$ps_car_14, na.rm = TRUE)
ps$ps_ind_02_cat[is.na(ps$ps_ind_02_cat)] <- mode(ps$ps_ind_02_cat, na.rm = TRUE)
ps$ps_car_01_cat[is.na(ps$ps_car_01_cat)] <- mode(ps$ps_car_01_cat, na.rm = TRUE)
ps$ps_ind_04_cat[is.na(ps$ps_ind_04_cat)] <- mode(ps$ps_ind_04_cat, na.rm = TRUE)
ps$ps_car_02_cat[is.na(ps$ps_car_02_cat)] <- mode(ps$ps_car_02_cat, na.rm = TRUE)
ps$ps_car_11[is.na(ps$ps_car_11)] <- median(ps$ps_car_11, na.rm = TRUE)
ps$ps_car_12[is.na(ps$ps_car_12)] <- median(ps$ps_car_12, na.rm = TRUE)
ps$ps_car_09_cat[is.na(ps$ps_car_09_cat)] <- mode(ps$ps_car_09_cat, na.rm = TRUE)
ps$ps_ind_05_cat[is.na(ps$ps_ind_05_cat)] <- mode(ps$ps_ind_05_cat, na.rm = TRUE)
ps$ps_car_07_cat[is.na(ps$ps_car_07_cat)] <- mode(ps$ps_car_07_cat, na.rm = TRUE)
table(is.na(ps))
## 
##    FALSE 
## 32141448
sapply(ps, function(x) sum(is.na(x)))
##         target      ps_ind_01  ps_ind_02_cat      ps_ind_03  ps_ind_04_cat 
##              0              0              0              0              0 
##  ps_ind_05_cat  ps_ind_06_bin  ps_ind_07_bin  ps_ind_08_bin  ps_ind_09_bin 
##              0              0              0              0              0 
##  ps_ind_10_bin  ps_ind_11_bin  ps_ind_12_bin  ps_ind_13_bin      ps_ind_14 
##              0              0              0              0              0 
##      ps_ind_15  ps_ind_16_bin  ps_ind_17_bin  ps_ind_18_bin      ps_reg_01 
##              0              0              0              0              0 
##      ps_reg_02  ps_car_01_cat  ps_car_02_cat  ps_car_04_cat  ps_car_06_cat 
##              0              0              0              0              0 
##  ps_car_07_cat  ps_car_08_cat  ps_car_09_cat  ps_car_10_cat      ps_car_11 
##              0              0              0              0              0 
##      ps_car_12      ps_car_13      ps_car_14      ps_car_15     ps_calc_01 
##              0              0              0              0              0 
##     ps_calc_02     ps_calc_03     ps_calc_04     ps_calc_05     ps_calc_06 
##              0              0              0              0              0 
##     ps_calc_07     ps_calc_08     ps_calc_09     ps_calc_10     ps_calc_11 
##              0              0              0              0              0 
##     ps_calc_12     ps_calc_13     ps_calc_14 ps_calc_15_bin ps_calc_16_bin 
##              0              0              0              0              0 
## ps_calc_17_bin ps_calc_18_bin ps_calc_19_bin ps_calc_20_bin 
##              0              0              0              0

3.4 Balance Sampling

The target column which we want to predict is unbalance, so we use ovun.sample() to get balance data result. It is necessary to balanced data before applying a machine learning algorithm. If we don’t balance it in this case, the algorithm gets biased toward the majority class and fails to map minority class.

We have also compared the performance of the model using original data and balanced sampling data, the latter one has better performance.

3.4.1 The Original Distribution of target

Check the original distribution of the target. There is only 3.7% observation with target = 1, that is unbalance.

ggplot(data = PS, aes(x = as.factor(target))) + 
    geom_bar(fill = "lightblue") + 
    labs(title = 'Distribution of Target Class (1 = claim filed)',x='target') +
    theme(plot.title = element_text(hjust = 0.5))

table(ps$target)/nrow(ps)
## 
##          0          1 
## 0.96355248 0.03644752

3.4.2 Get balanced sample

Since the operating speed of different models are quite different, and we tend to using the biggest size of data considering the operating time on our own computer.

We rebalance our data with 70% of target=0, 30% with target=1.

pddata0 <- ovun.sample(target~.,data=ps,method = "both", p = .3, N = 2000, seed = 1)$data
pddata <- as.data.frame(pddata0)
table(pddata0$target)/nrow(pddata0)
## 
##      0      1 
## 0.6975 0.3025
pddata0 <- pddata0 %>%
   mutate_at(vars(ends_with("cat")), funs(factor)) %>%
   mutate_at(vars(ends_with("cat")), funs(as.numeric))
str(pddata0)
## 'data.frame':    2000 obs. of  54 variables:
##  $ target        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_01     : int  5 0 0 0 5 4 1 1 5 2 ...
##  $ ps_ind_02_cat : num  4 4 1 1 1 1 1 1 2 1 ...
##  $ ps_ind_03     : int  10 7 2 10 5 8 4 7 3 6 ...
##  $ ps_ind_04_cat : num  2 1 2 2 1 1 1 1 2 1 ...
##  $ ps_ind_05_cat : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_06_bin : int  0 0 1 0 0 1 1 1 0 0 ...
##  $ ps_ind_07_bin : int  0 0 0 1 1 0 0 0 1 1 ...
##  $ ps_ind_08_bin : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_09_bin : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_10_bin : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_11_bin : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_12_bin : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ ps_ind_13_bin : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_14     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ ps_ind_15     : int  8 9 3 2 9 10 12 6 7 0 ...
##  $ ps_ind_16_bin : int  1 0 0 0 1 1 1 1 1 0 ...
##  $ ps_ind_17_bin : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ ps_ind_18_bin : int  0 0 1 1 0 0 0 0 0 1 ...
##  $ ps_reg_01     : num  0.9 0.9 0.4 0.6 0.9 0.9 0.3 0.9 0.9 0.7 ...
##  $ ps_reg_02     : num  1.4 1.2 0 0.7 0.3 0.4 0 0.4 1.6 1.2 ...
##  $ ps_car_01_cat : num  4 4 9 4 6 9 3 1 9 4 ...
##  $ ps_car_02_cat : num  1 1 2 1 1 2 2 2 1 1 ...
##  $ ps_car_04_cat : num  1 1 1 1 1 1 1 1 1 9 ...
##  $ ps_car_06_cat : num  16 15 12 15 2 12 2 12 2 15 ...
##  $ ps_car_07_cat : num  2 2 2 2 2 2 2 2 1 2 ...
##  $ ps_car_08_cat : num  2 2 2 2 2 2 2 2 2 1 ...
##  $ ps_car_09_cat : num  3 3 1 3 3 1 1 1 1 3 ...
##  $ ps_car_10_cat : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ ps_car_11     : int  2 3 2 3 2 3 2 2 3 3 ...
##  $ ps_car_12     : num  0.424 0.447 0.374 0.447 0.4 ...
##  $ ps_car_13     : num  0.754 0.828 0.627 0.895 0.923 ...
##  $ ps_car_14     : num  0.342 0.361 0.367 0.336 0.408 ...
##  $ ps_car_15     : num  2 2.83 2.65 3 3.61 ...
##  $ ps_calc_01    : num  0.1 0 0.1 0.7 0.6 0 0.3 0.3 0 0.5 ...
##  $ ps_calc_02    : num  0.3 0.5 0.9 0.1 0.1 0 0 0.5 0.3 0 ...
##  $ ps_calc_03    : num  0.5 0.8 0.6 0.1 0.6 0.8 0.7 0.6 0.9 0.4 ...
##  $ ps_calc_04    : int  2 0 5 1 3 4 3 2 3 1 ...
##  $ ps_calc_05    : int  3 1 0 2 2 1 0 2 1 2 ...
##  $ ps_calc_06    : int  8 8 9 8 6 7 8 10 8 6 ...
##  $ ps_calc_07    : int  1 3 5 4 4 1 4 1 3 5 ...
##  $ ps_calc_08    : int  9 8 10 8 9 9 11 11 9 8 ...
##  $ ps_calc_09    : int  4 3 2 1 2 3 3 2 2 2 ...
##  $ ps_calc_10    : int  8 7 10 9 9 4 10 9 8 8 ...
##  $ ps_calc_11    : int  2 6 5 7 3 4 12 7 3 8 ...
##  $ ps_calc_12    : int  3 3 0 1 2 2 1 2 1 0 ...
##  $ ps_calc_13    : int  6 3 5 4 3 2 0 1 1 5 ...
##  $ ps_calc_14    : int  6 11 5 5 9 13 9 7 10 11 ...
##  $ ps_calc_15_bin: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ ps_calc_16_bin: int  0 1 1 1 1 1 1 1 0 0 ...
##  $ ps_calc_17_bin: int  1 0 0 1 1 0 0 1 1 1 ...
##  $ ps_calc_18_bin: int  0 0 1 0 1 0 0 0 0 0 ...
##  $ ps_calc_19_bin: int  0 0 0 0 1 0 0 0 1 0 ...
##  $ ps_calc_20_bin: int  0 0 0 0 0 0 0 0 1 1 ...

3.5 Turn the data type and seperate the test data and train data.

As showed above, the “pddata” has some variables with type of chr, so we need to change the data type, and also prepare for the two types of data set for the methods we will apply later.

  • pddata - all the binary and categorical variables set as factor.

  • pddata0 - all variables num or int.

And we also divide our data set into train and test, seperatly with 70% and 30%.

factorvariable<-function(data){
  data <- data %>%
    mutate_at(vars(ends_with("cat")), funs(factor)) %>%
    mutate_at(vars(ends_with("bin")), funs(factor))
  data <- data %>%
    mutate_at((split(names(data),sapply(data, function(x) paste(class(x), collapse=" ")))$integer), funs(as.ordered))
  return(data)
}
pddata$target <- as.factor(pddata0$target) 
pddata<-factorvariable(pddata0)

str(pddata)
## 'data.frame':    2000 obs. of  54 variables:
##  $ target        : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_01     : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 6 1 1 1 6 5 2 2 6 3 ...
##  $ ps_ind_02_cat : Factor w/ 4 levels "1","2","3","4": 4 4 1 1 1 1 1 1 2 1 ...
##  $ ps_ind_03     : Ord.factor w/ 12 levels "0"<"1"<"2"<"3"<..: 11 8 3 11 6 9 5 8 4 7 ...
##  $ ps_ind_04_cat : Factor w/ 2 levels "1","2": 2 1 2 2 1 1 1 1 2 1 ...
##  $ ps_ind_05_cat : Factor w/ 7 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_06_bin : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 2 2 1 1 ...
##  $ ps_ind_07_bin : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 1 2 2 ...
##  $ ps_ind_08_bin : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_09_bin : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_10_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_11_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_12_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ ps_ind_13_bin : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_14     : Ord.factor w/ 3 levels "0"<"1"<"2": 1 1 1 1 1 1 1 1 1 2 ...
##  $ ps_ind_15     : Ord.factor w/ 14 levels "0"<"1"<"2"<"3"<..: 9 10 4 3 10 11 13 7 8 1 ...
##  $ ps_ind_16_bin : Factor w/ 2 levels "0","1": 2 1 1 1 2 2 2 2 2 1 ...
##  $ ps_ind_17_bin : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ps_ind_18_bin : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 2 ...
##  $ ps_reg_01     : num  0.9 0.9 0.4 0.6 0.9 0.9 0.3 0.9 0.9 0.7 ...
##  $ ps_reg_02     : num  1.4 1.2 0 0.7 0.3 0.4 0 0.4 1.6 1.2 ...
##  $ ps_car_01_cat : Factor w/ 12 levels "1","2","3","4",..: 4 4 9 4 6 9 3 1 9 4 ...
##  $ ps_car_02_cat : Factor w/ 2 levels "1","2": 1 1 2 1 1 2 2 2 1 1 ...
##  $ ps_car_04_cat : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 9 ...
##  $ ps_car_06_cat : Factor w/ 18 levels "1","2","3","4",..: 16 15 12 15 2 12 2 12 2 15 ...
##  $ ps_car_07_cat : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 1 2 ...
##  $ ps_car_08_cat : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 1 ...
##  $ ps_car_09_cat : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 3 3 1 1 1 1 3 ...
##  $ ps_car_10_cat : Factor w/ 3 levels "1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ps_car_11     : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 3 4 3 4 3 4 3 3 4 4 ...
##  $ ps_car_12     : num  0.424 0.447 0.374 0.447 0.4 ...
##  $ ps_car_13     : num  0.754 0.828 0.627 0.895 0.923 ...
##  $ ps_car_14     : num  0.342 0.361 0.367 0.336 0.408 ...
##  $ ps_car_15     : num  2 2.83 2.65 3 3.61 ...
##  $ ps_calc_01    : num  0.1 0 0.1 0.7 0.6 0 0.3 0.3 0 0.5 ...
##  $ ps_calc_02    : num  0.3 0.5 0.9 0.1 0.1 0 0 0.5 0.3 0 ...
##  $ ps_calc_03    : num  0.5 0.8 0.6 0.1 0.6 0.8 0.7 0.6 0.9 0.4 ...
##  $ ps_calc_04    : Ord.factor w/ 6 levels "0"<"1"<"2"<"3"<..: 3 1 6 2 4 5 4 3 4 2 ...
##  $ ps_calc_05    : Ord.factor w/ 7 levels "0"<"1"<"2"<"3"<..: 4 2 1 3 3 2 1 3 2 3 ...
##  $ ps_calc_06    : Ord.factor w/ 8 levels "3"<"4"<"5"<"6"<..: 6 6 7 6 4 5 6 8 6 4 ...
##  $ ps_calc_07    : Ord.factor w/ 9 levels "0"<"1"<"2"<"3"<..: 2 4 6 5 5 2 5 2 4 6 ...
##  $ ps_calc_08    : Ord.factor w/ 9 levels "4"<"5"<"6"<"7"<..: 6 5 7 5 6 6 8 8 6 5 ...
##  $ ps_calc_09    : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 5 4 3 2 3 4 4 3 3 3 ...
##  $ ps_calc_10    : Ord.factor w/ 20 levels "0"<"1"<"2"<"3"<..: 9 8 11 10 10 5 11 10 9 9 ...
##  $ ps_calc_11    : Ord.factor w/ 16 levels "0"<"1"<"2"<"3"<..: 3 7 6 8 4 5 13 8 4 9 ...
##  $ ps_calc_12    : Ord.factor w/ 8 levels "0"<"1"<"2"<"3"<..: 4 4 1 2 3 3 2 3 2 1 ...
##  $ ps_calc_13    : Ord.factor w/ 11 levels "0"<"1"<"2"<"3"<..: 7 4 6 5 4 3 1 2 2 6 ...
##  $ ps_calc_14    : Ord.factor w/ 19 levels "1"<"2"<"3"<"4"<..: 6 11 5 5 9 13 9 7 10 11 ...
##  $ ps_calc_15_bin: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ ps_calc_16_bin: Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 1 1 ...
##  $ ps_calc_17_bin: Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 2 2 2 ...
##  $ ps_calc_18_bin: Factor w/ 2 levels "0","1": 1 1 2 1 2 1 1 1 1 1 ...
##  $ ps_calc_19_bin: Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 2 1 ...
##  $ ps_calc_20_bin: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 2 ...
set.seed(123)
sample_index <- sample(nrow(pddata), replace = FALSE, size = 0.7*nrow(pddata))
train <- pddata[sample_index,]
test<-pddata[-sample_index,]
train0 <- pddata0[sample_index,]
test0<-pddata0[-sample_index,]

3.6 Data Visulization after Balance Sampling

ggplot(pddata0, aes(target)) + geom_bar(fill = "lightblue") + ggtitle("Distribution of Target") + theme(plot.title = element_text(hjust = 0.5))

#ps_ind(continuous)
p1 <- ggplot(pddata0, aes(ps_ind_01)) + geom_histogram(binwidth = 1, fill = "lightblue") 
p2 <- ggplot(pddata0, aes(ps_ind_03)) + geom_histogram(binwidth = 1, fill = "lightblue") 
p3 <- ggplot(pddata0, aes(ps_ind_14)) + geom_histogram(binwidth = 1, fill = "lightblue") 
p4 <- ggplot(pddata0, aes(ps_ind_15)) + geom_histogram(binwidth = 1, fill = "lightblue") 

#ps_ind(category)
p5 <- ggplot(pddata0, aes(ps_ind_02_cat)) + geom_bar(fill = "lightblue")
p6 <- ggplot(pddata0, aes(ps_ind_04_cat)) + geom_bar(fill = "lightblue")
p7 <- ggplot(pddata0, aes(ps_ind_05_cat)) + geom_bar(fill = "lightblue")

#ps_ind(binary)
p8 <- ggplot(pddata0, aes(ps_ind_06_bin)) + geom_bar(fill = "lightblue")
p9 <- ggplot(pddata0, aes(ps_ind_07_bin)) + geom_bar(fill = "lightblue")
p10 <- ggplot(pddata0, aes(ps_ind_08_bin)) + geom_bar(fill = "lightblue")
p11 <- ggplot(pddata0, aes(ps_ind_09_bin)) + geom_bar(fill = "lightblue")
p12 <- ggplot(pddata0, aes(ps_ind_10_bin)) + geom_bar(fill = "lightblue")
p13 <- ggplot(pddata0, aes(ps_ind_11_bin)) + geom_bar(fill = "lightblue")
p14 <- ggplot(pddata0, aes(ps_ind_12_bin)) + geom_bar(fill = "lightblue")
p15 <- ggplot(pddata0, aes(ps_ind_13_bin)) + geom_bar(fill = "lightblue")
p16 <- ggplot(pddata0, aes(ps_ind_16_bin)) + geom_bar(fill = "lightblue")
p17 <- ggplot(pddata0, aes(ps_ind_17_bin)) + geom_bar(fill = "lightblue")
p18 <- ggplot(pddata0, aes(ps_ind_18_bin)) + geom_bar(fill = "lightblue")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, nrow = 5)

#ps_reg
p1 <- ggplot(pddata0, aes(ps_reg_01)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p2 <- ggplot(pddata0, aes(ps_reg_02)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
grid.arrange(p1, p2, nrow = 1)

#ps_car(continuous)
p1 <- ggplot(pddata0, aes(ps_car_11)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p2 <- ggplot(pddata0, aes(ps_car_12)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p3 <- ggplot(pddata0, aes(ps_car_13)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p4 <- ggplot(pddata0, aes(ps_car_14)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p5 <- ggplot(pddata0, aes(ps_car_15)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 

#ps_car(category)
p6 <- ggplot(pddata0, aes(ps_car_01_cat)) + geom_bar(fill = "lightblue")
p7 <- ggplot(pddata0, aes(ps_car_02_cat)) + geom_bar(fill = "lightblue")
p9 <- ggplot(pddata0, aes(ps_car_04_cat)) + geom_bar(fill = "lightblue")
p11 <- ggplot(pddata0, aes(ps_car_06_cat)) + geom_bar(fill = "lightblue")
p12 <- ggplot(pddata0, aes(ps_car_07_cat)) + geom_bar(fill = "lightblue")
p13 <- ggplot(pddata0, aes(ps_car_08_cat)) + geom_bar(fill = "lightblue")
p14 <- ggplot(pddata0, aes(ps_car_09_cat)) + geom_bar(fill = "lightblue")
p15 <- ggplot(pddata0, aes(ps_car_10_cat)) + geom_bar(fill = "lightblue")
grid.arrange(p1, p2, p3, p4, p5,p6, p7, p9, p11, p12, p13, p14, p15, nrow = 4)

#ps_calc(continuous)
p1 <- ggplot(pddata0, aes(ps_calc_01)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p2 <- ggplot(pddata0, aes(ps_calc_02)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p3 <- ggplot(pddata0, aes(ps_calc_03)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p4 <- ggplot(pddata0, aes(ps_calc_04)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p5 <- ggplot(pddata0, aes(ps_calc_05)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p6 <- ggplot(pddata0, aes(ps_calc_06)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p7 <- ggplot(pddata0, aes(ps_calc_07)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p8 <- ggplot(pddata0, aes(ps_calc_08)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p9 <- ggplot(pddata0, aes(ps_calc_09)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p10 <- ggplot(pddata0, aes(ps_calc_10)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p11 <- ggplot(pddata0, aes(ps_calc_11)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p12 <- ggplot(pddata0, aes(ps_calc_12)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p13 <- ggplot(pddata0, aes(ps_calc_13)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 
p14 <- ggplot(pddata0, aes(ps_calc_14)) + geom_histogram(binwidth = 0.5, fill = "lightblue") 

#ps_calc(binary)
p15 <- ggplot(pddata0, aes(ps_calc_15_bin)) + geom_bar(fill = "lightblue")
p16 <- ggplot(pddata0, aes(ps_calc_16_bin)) + geom_bar(fill = "lightblue")
p17 <- ggplot(pddata0, aes(ps_calc_17_bin)) + geom_bar(fill = "lightblue")
p18 <- ggplot(pddata0, aes(ps_calc_18_bin)) + geom_bar(fill = "lightblue")
p19 <- ggplot(pddata0, aes(ps_calc_19_bin)) + geom_bar(fill = "lightblue")
p20 <- ggplot(pddata0, aes(ps_calc_20_bin)) + geom_bar(fill = "lightblue")
grid.arrange(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10, p11, p12, p13, p14, p15, p16, p17, p18, p19, p20, nrow = 5)

4 Methods and Results

4.1 Logistic Regression

4.1.1 Correlation

Checking the correlations between variables. corrplot is a good way to visualize the correlation between variables.

M <- cor(pddata0[1:36])  # the latter variables has little correlation with others
corrplot(M, method = "circle", title = "Correlation", tl.cex = 0.5, tl.col = 'black', mar=c(1, 1, 1, 1))

cor(ps$ps_ind_14, ps$ps_ind_12_bin)
## [1] 0.8901273
cor(ps$ps_ind_14, ps$ps_ind_11_bin)
## [1] 0.564903
cor(ps$ps_car_13, ps$ps_car_12)
## [1] 0.672014
cor(ps$ps_ind_18_bin, ps$ps_ind_16_bin)
## [1] -0.5942654

Decide to delete ps_ind_14 from the logistic model.

4.1.2 Train the Model

base.model <- glm(target ~ 1, data = train_train, family = binomial(link = 'logit'))
all.model <- glm(target ~ . , data = train_train, family = binomial(link = 'logit'))
ols_step <- step(base.model, scope = list(lower = base.model, upper = all.model), direction = 'both', k=2, trace = F)
ols_step
## 
## Call:  glm(formula = target ~ ps_car_13 + ps_ind_05_cat + ps_ind_15 + 
##     ps_ind_17_bin + ps_ind_06_bin + ps_car_01_cat + ps_car_07_cat + 
##     ps_ind_09_bin + ps_ind_03 + ps_calc_02 + ps_calc_10, family = binomial(link = "logit"), 
##     data = train_train)
## 
## Coefficients:
##   (Intercept)      ps_car_13  ps_ind_05_cat      ps_ind_15  ps_ind_17_bin  
##      0.008193       0.881509       0.134945      -0.053724       0.485027  
## ps_ind_06_bin  ps_car_01_cat  ps_car_07_cat  ps_ind_09_bin      ps_ind_03  
##     -0.393706      -0.040550      -0.447874      -0.207529       0.024432  
##    ps_calc_02     ps_calc_10  
##     -0.214374      -0.019019  
## 
## Degrees of Freedom: 3499 Total (i.e. Null);  3488 Residual
## Null Deviance:       4318 
## Residual Deviance: 4124  AIC: 4148
summary(ols_step)
## 
## Call:
## glm(formula = target ~ ps_car_13 + ps_ind_05_cat + ps_ind_15 + 
##     ps_ind_17_bin + ps_ind_06_bin + ps_car_01_cat + ps_car_07_cat + 
##     ps_ind_09_bin + ps_ind_03 + ps_calc_02 + ps_calc_10, family = binomial(link = "logit"), 
##     data = train_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7316  -0.8625  -0.7077   1.2540   2.0253  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.008193   0.402808   0.020  0.98377    
## ps_car_13      0.881509   0.169698   5.195 2.05e-07 ***
## ps_ind_05_cat  0.134945   0.024764   5.449 5.06e-08 ***
## ps_ind_15     -0.053724   0.010653  -5.043 4.58e-07 ***
## ps_ind_17_bin  0.485027   0.105043   4.617 3.89e-06 ***
## ps_ind_06_bin -0.393706   0.090733  -4.339 1.43e-05 ***
## ps_car_01_cat -0.040550   0.012430  -3.262  0.00111 ** 
## ps_car_07_cat -0.447874   0.161448  -2.774  0.00554 ** 
## ps_ind_09_bin -0.207529   0.109204  -1.900  0.05738 .  
## ps_ind_03      0.024432   0.014255   1.714  0.08654 .  
## ps_calc_02    -0.214374   0.132249  -1.621  0.10502    
## ps_calc_10    -0.019019   0.012778  -1.488  0.13663    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4317.6  on 3499  degrees of freedom
## Residual deviance: 4123.9  on 3488  degrees of freedom
## AIC: 4147.9
## 
## Number of Fisher Scoring iterations: 4
log_preds <- predict(ols_step, newdata = train_test[,-1], type = "response")
log_preds <- data.frame(log_preds = log_preds)
log_preds <- round(log_preds)
table(log_preds)
## log_preds
##    0    1 
## 1401   99
log_pred_compare <- cbind(log_preds, train_test$target)

4.1.3 Evaluate the Performance

The Kappa statistic varies from 0 to 1, where:

  • 0 = agreement equivalent to chance.
  • 0.1 – 0.20 = slight agreement.
  • 0.21 – 0.40 = fair agreement.
  • 0.41 – 0.60 = moderate agreement.
  • 0.61 – 0.80 = substantial agreement.
  • 0.81 – 0.99 = near perfect agreement
  • 1 = perfect agreement.
confusionMatrix(as.factor(log_pred_compare$log_preds), as.factor(log_pred_compare$`train_test$target`))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1001  400
##          1   47   52
##                                           
##                Accuracy : 0.702           
##                  95% CI : (0.6781, 0.7251)
##     No Information Rate : 0.6987          
##     P-Value [Acc > NIR] : 0.4014          
##                                           
##                   Kappa : 0.0902          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9552          
##             Specificity : 0.1150          
##          Pos Pred Value : 0.7145          
##          Neg Pred Value : 0.5253          
##              Prevalence : 0.6987          
##          Detection Rate : 0.6673          
##    Detection Prevalence : 0.9340          
##       Balanced Accuracy : 0.5351          
##                                           
##        'Positive' Class : 0               
## 
logi_cor<-cor(log_pred_compare$log_preds, log_pred_compare$`train_test$target`);logi_cor
## [1] 0.1297273

The typical interpretation of the area under curve (AUC):

  • Outstanding: 0.9-1.0
  • Excellent/good: 0.8-0.9
  • Acceptable/fair: 0.7-0.8
  • Poor: 0.6-0.7
  • No discrimination: 0.5-0.6
pred<-ROCR::prediction(log_pred_compare$log_preds, log_pred_compare$`train_test$target`) 
roc_logi<-performance(pred,measure="tpr", x.measure="fpr")
plot(roc_logi, main="ROC curve for Logistic Regression Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)

roc_logi_auc<-performance(pred, measure="auc")
roc_logi_auc@y.values
## [[1]]
## [1] 0.5350985
cor(log_pred_compare$log_preds, log_pred_compare$`train_test$target`)
## [1] 0.1297273

4.1.4 Conclusion

As the result of logistic regression shows, the stepwise algorithm chooses 11 variables into the model. But as the evaluation shows, this model is not robust, since accuracy, kappa or the area under ROC curve all shows that its performance is poor.

Logistic regression may not be a good model for this case, but the variable it choose could be used in other algorithm.

4.2 PCA

4.2.1 Fitting a PCA model

prin_comp <- prcomp(train0[,-1], scale. = T)

4.2.2 Visulization of PCS

fviz_pca_biplot(prin_comp, axes = c(1, 2), geom = "point",
  col.ind = "black", col.var = "steelblue", label = "all",
  invisible = "none", repel = F, habillage = train$target, 
  palette = NULL, addEllipses = TRUE, title = "PCA - Biplot")

#compute standard deviation of each principal component
std_dev <- prin_comp$sdev
#compute variance
pr_var <- std_dev^2
#check variance of first 10 components
pr_var[1:10]
##  [1] 3.796254 2.678176 2.054988 1.882825 1.827166 1.604345 1.482933
##  [8] 1.393821 1.304530 1.287704
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]
##  [1] 0.07162743 0.05053162 0.03877335 0.03552500 0.03447484 0.03027065
##  [7] 0.02797987 0.02629852 0.02461378 0.02429631 0.02356565 0.02290595
## [13] 0.02200347 0.02139322 0.02090439 0.02086223 0.02061759 0.02040525
## [19] 0.01996466 0.01950848
plot(prop_varex, xlab = "Principal Component",
         ylab = "Proportion of Variance Explained",
         type = "b")

 plot(cumsum(prop_varex), xlab = "Principal Component",
        ylab = "Cumulative Proportion of Variance Explained",type="b")

The plot above shows that ~ 40/50 components explains around 90%+ variance in the data set. In order words, using PCA we have reduced 54 predictors to 40/50 without compromising on too many explained variance.

4.2.3 Build a PCA-C5.0 model

Build the C5.0 model on raw data for a sanity check

cmodel<-C5.0(train0[,-1],as.factor(train$target))
predrp <- predict(cmodel, test0[, -1], type="class")
target1<-as.factor(test0$target)
confusionMatrix(predrp, target1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 315 114
##          1 103  68
##                                           
##                Accuracy : 0.6383          
##                  95% CI : (0.5984, 0.6768)
##     No Information Rate : 0.6967          
##     P-Value [Acc > NIR] : 0.9991          
##                                           
##                   Kappa : 0.1294          
##  Mcnemar's Test P-Value : 0.4972          
##                                           
##             Sensitivity : 0.7536          
##             Specificity : 0.3736          
##          Pos Pred Value : 0.7343          
##          Neg Pred Value : 0.3977          
##              Prevalence : 0.6967          
##          Detection Rate : 0.5250          
##    Detection Prevalence : 0.7150          
##       Balanced Accuracy : 0.5636          
##                                           
##        'Positive' Class : 0               
## 
predrp<-ROCR::prediction(predictions=as.numeric(predrp), labels=test$target) 
rocrp<-performance(predrp,measure="tpr", x.measure="fpr")
plot(rocrp, main="ROC curve for C5.0 Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)

rocrp_auc<-performance(predrp, measure="auc")
rocrp_auc@y.values
## [[1]]
## [1] 0.5636074

add a training set with principal components

train.data <- data.frame(target = train0$target, prin_comp$x)
train.data <- train.data[,1:40]
test.zspace <- predict(prin_comp, newdata=test0[, -1])
pca.train.df <- as.data.frame(prin_comp$x)

Run a decision tree,transform test into PCA

pca.model <- C5.0(pca.train.df,as.factor(train0$target))
pca.test.df <- as.data.frame(test.zspace)
pca.test.df$target <- test0[, 1]
pca.pred <- predict(pca.model, pca.test.df[,-54], type="class")
confusionMatrix(pca.pred, target1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 374 142
##          1  44  40
##                                           
##                Accuracy : 0.69            
##                  95% CI : (0.6513, 0.7268)
##     No Information Rate : 0.6967          
##     P-Value [Acc > NIR] : 0.6571          
##                                           
##                   Kappa : 0.135           
##  Mcnemar's Test P-Value : 1.141e-12       
##                                           
##             Sensitivity : 0.8947          
##             Specificity : 0.2198          
##          Pos Pred Value : 0.7248          
##          Neg Pred Value : 0.4762          
##              Prevalence : 0.6967          
##          Detection Rate : 0.6233          
##    Detection Prevalence : 0.8600          
##       Balanced Accuracy : 0.5573          
##                                           
##        'Positive' Class : 0               
## 
predpca<-ROCR::prediction(predictions=as.numeric(pca.pred), labels=test$target) 
rocpca<-performance(predpca,measure="tpr", x.measure="fpr")
plot(rocpca, main="ROC curve for PCA+C5.0 Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)

rocpca_auc<-performance(predpca, measure="auc")
rocpca_auc@y.values
## [[1]]
## [1] 0.5572585

4.2.4 Conclusion

In this data, PCs don’t have a “elbow” plot, and most of the PCs explain about the same amount of variation. Thus, it’s hard to tell which PCs or factors we need to pick. And the performance of C5.0 model is not improved and even worse. Thus, the PCA is not suitable for this data analysis.

4.3 RandomForest model

4.3.1 Build a Random Forest model

mt<-floor(sqrt(ncol(train)))
rf.fit <- randomForest(target~. , data=train,importance=TRUE,ntree=500,mtry=mt)
varImpPlot(rf.fit, cex=0.5); print(rf.fit)

## 
## Call:
##  randomForest(formula = target ~ ., data = train, importance = TRUE,      ntree = 500, mtry = mt) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 28.64%
## Confusion matrix:
##     0  1 class.error
## 0 935 42  0.04298874
## 1 359 64  0.84869976
plot(rf.fit,main="model")

The select of the mtry: \[mtry=floor(\sqrt{ncol(data)})=7\]

4.3.2 Get the important features

var.imp <- data.frame(importance(rf.fit,type=2))
var.imp$Variables <- row.names(var.imp)
varord<-var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),]
imprf<-varord$Variables[1:25]
imprf
##  [1] "ps_car_06_cat" "ps_car_13"     "ps_car_01_cat" "ps_car_14"    
##  [5] "ps_ind_15"     "ps_calc_03"    "ps_reg_02"     "ps_calc_14"   
##  [9] "ps_ind_03"     "ps_calc_10"    "ps_calc_01"    "ps_calc_02"   
## [13] "ps_calc_11"    "ps_car_15"     "ps_reg_01"     "ps_calc_13"   
## [17] "ps_car_12"     "ps_calc_07"    "ps_ind_01"     "ps_calc_08"   
## [21] "ps_calc_06"    "ps_calc_05"    "ps_calc_09"    "ps_ind_05_cat"
## [25] "ps_calc_04"
predrf1 <- predict(rf.fit,test,type="prob")
summary(predrf1)
##        0                1         
##  Min.   :0.2300   Min.   :0.0520  
##  1st Qu.:0.6120   1st Qu.:0.2455  
##  Median :0.6900   Median :0.3100  
##  Mean   :0.6768   Mean   :0.3232  
##  3rd Qu.:0.7545   3rd Qu.:0.3880  
##  Max.   :0.9480   Max.   :0.7700
predrf2<-predict(rf.fit,test)
confusionMatrix(predrf2,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 396 161
##          1  22  21
##                                           
##                Accuracy : 0.695           
##                  95% CI : (0.6564, 0.7316)
##     No Information Rate : 0.6967          
##     P-Value [Acc > NIR] : 0.5552          
##                                           
##                   Kappa : 0.08            
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 0.1154          
##          Pos Pred Value : 0.7110          
##          Neg Pred Value : 0.4884          
##              Prevalence : 0.6967          
##          Detection Rate : 0.6600          
##    Detection Prevalence : 0.9283          
##       Balanced Accuracy : 0.5314          
##                                           
##        'Positive' Class : 0               
## 
pred<-ROCR::prediction(predictions=predrf1[,2], labels=test$target) 
roc<-performance(pred,measure="tpr", x.measure="fpr")
plot(roc, main="ROC curve for Random Forest Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)

roc_auc<-performance(pred, measure="auc")
roc_auc@y.values
## [[1]]
## [1] 0.6478061

4.3.3 Conclusion

As expected, the performance of Random Forest model is much better than the single decision tree since Random Forest model is an ensemble classifier which uses many decision tree models to predict the result.

4.4 Neural Networks

4.4.1 Build a NN model

fmla <- as.formula(paste("target ~ ", paste(colnames(train0[,-1]), collapse= "+")))

NN_model<-neuralnet(fmla, data=as.matrix(train0), hidden = 5,stepmax = 1e6)

NN_pred<-compute(NN_model, test0[, -1])
NN_pred_results<-ifelse(NN_pred$net.result>0.5,1,0)
confusionMatrix(as.factor(NN_pred_results),as.factor(test0$target))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 376 152
##          1  42  30
##                                                  
##                Accuracy : 0.6766667              
##                  95% CI : (0.637605, 0.7139784)  
##     No Information Rate : 0.6966667              
##     P-Value [Acc > NIR] : 0.8662546              
##                                                  
##                   Kappa : 0.077596               
##  Mcnemar's Test P-Value : 0.000000000000005046636
##                                                  
##             Sensitivity : 0.8995215              
##             Specificity : 0.1648352              
##          Pos Pred Value : 0.7121212              
##          Neg Pred Value : 0.4166667              
##              Prevalence : 0.6966667              
##          Detection Rate : 0.6266667              
##    Detection Prevalence : 0.8800000              
##       Balanced Accuracy : 0.5321783              
##                                                  
##        'Positive' Class : 0                      
## 
prednn<-ROCR::prediction(predictions=NN_pred_results, labels=target1) 
rocnn<-performance(prednn,measure="tpr", x.measure="fpr")
plot(rocnn, main="ROC curve for Neutral Network Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)

rocnn_auc<-performance(prednn, measure="auc")
rocnn_auc@y.values
## [[1]]
## [1] 0.532178348

4.4.2 Conclusion

The Neural Network is expected to perform better than the model before. However, the process of model building is really time-consuming that results in the difficulty of tuning parameters.

4.5 XGBoost

XGBoost is short for “Extreme Gradient Boosting”. XGBoost is used for supervised learning problems, where we use the training data (with multiple features) xi to predict a target variable yi. The algorithm contains two parts: training loss and regularization. The model of xgboost: tree ensembles. The tree ensemble model is a set of classification and regression trees (CART). In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification.

4.5.1 Based on all the variables

4.5.1.1 Train the Model

labels = train_train['target']
y <- labels$target

bstSparse <- xgboost(data = data.matrix(train_train[,-1]), label = y, max.depth = 30, eta = 0.05, nthread = 3, nround = 30, objective = "binary:logistic")
## [1]  train-error:0.034051 
## [2]  train-error:0.024060 
## [3]  train-error:0.016566 
## [4]  train-error:0.013366 
## [5]  train-error:0.009706 
## [6]  train-error:0.007871 
## [7]  train-error:0.006891 
## [8]  train-error:0.006149 
## [9]  train-error:0.005503 
## [10] train-error:0.005157 
## [11] train-error:0.004663 
## [12] train-error:0.004180 
## [13] train-error:0.003954 
## [14] train-error:0.003671 
## [15] train-error:0.003391 
## [16] train-error:0.003200 
## [17] train-error:0.003009 
## [18] train-error:0.002783 
## [19] train-error:0.002649 
## [20] train-error:0.002403 
## [21] train-error:0.002214 
## [22] train-error:0.002074 
## [23] train-error:0.002000 
## [24] train-error:0.001923 
## [25] train-error:0.001814 
## [26] train-error:0.001660 
## [27] train-error:0.001571 
## [28] train-error:0.001523 
## [29] train-error:0.001403 
## [30] train-error:0.001351
xgb_pred <- predict(bstSparse, data.matrix(train_test[,-1]))
xgb_pred <- data.frame(xgb_pred)
xgb_pred <- round(xgb_pred)
xgb_pred_compare <- cbind(xgb_pred, train_test$target)

4.5.1.2 Evaluate the Performance

confusionMatrix(as.factor(xgb_pred_compare$xgb_pred), as.factor(xgb_pred_compare$`train_test$target`))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 104446   1207
##          1    729  43618
##                                                   
##                Accuracy : 0.9870933               
##                  95% CI : (0.9865093, 0.9876584)  
##     No Information Rate : 0.7011667               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 0.9691067               
##  Mcnemar's Test P-Value : < 0.00000000000000022204
##                                                   
##             Sensitivity : 0.9930687               
##             Specificity : 0.9730731               
##          Pos Pred Value : 0.9885758               
##          Neg Pred Value : 0.9835615               
##              Prevalence : 0.7011667               
##          Detection Rate : 0.6963067               
##    Detection Prevalence : 0.7043533               
##       Balanced Accuracy : 0.9830709               
##                                                   
##        'Positive' Class : 0                       
## 
pred_xgb<-ROCR::prediction(xgb_pred_compare$xgb_pred, xgb_pred_compare$`train_test$target`) 
roc_xgb<-performance(pred_xgb, measure="tpr", x.measure="fpr")
plot(roc_xgb, main="ROC curve for XGBoost Model", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)

roc_xgb_auc<-performance(pred_xgb, measure="auc")
roc_xgb_auc@y.values
## [[1]]
## [1] 0.9830708785
cor(xgb_pred_compare$xgb_pred, xgb_pred_compare$`train_test$target`)
## [1] 0.9691348761

4.5.1.3 Conclusion

It is obvious that XGBoost model works perfect for this case. The reason for its good performance: XGBoost model runs much faster than other methods, so we apply the biggest data set to this case, which train the model better than others. And also XGBoost has sparsity-awareness, since boosted trees work especially well on categorical features, and our data set has much categorical variables. The reason why XGBoost runs so fast is that the sparse data structure and clever implementation allow XGBoost sort columns independently, this way, the sorting work can be divided up between parallel threads of CPU.

4.5.2 Based on Selected Variables from previous methods

4.5.2.1 Variables Selection

imprf  #selected variables from RandomForest
##  [1] "ps_car_06_cat" "ps_car_13"     "ps_car_01_cat" "ps_car_14"    
##  [5] "ps_ind_15"     "ps_calc_03"    "ps_reg_02"     "ps_calc_14"   
##  [9] "ps_ind_03"     "ps_calc_10"    "ps_calc_01"    "ps_calc_02"   
## [13] "ps_calc_11"    "ps_car_15"     "ps_reg_01"     "ps_calc_13"   
## [17] "ps_car_12"     "ps_calc_07"    "ps_ind_01"     "ps_calc_08"   
## [21] "ps_calc_06"    "ps_calc_05"    "ps_calc_09"    "ps_ind_05_cat"
## [25] "ps_calc_04"
ols_step$coefficients  # from logistic regression
##     (Intercept)       ps_car_13   ps_ind_05_cat       ps_ind_15 
##  0.008192848643  0.881508871955  0.134944934632 -0.053723681378 
##   ps_ind_17_bin   ps_ind_06_bin   ps_car_01_cat   ps_car_07_cat 
##  0.485026816911 -0.393705632853 -0.040549611039 -0.447873929583 
##   ps_ind_09_bin       ps_ind_03      ps_calc_02      ps_calc_10 
## -0.207528910331  0.024432051182 -0.214374023265 -0.019019396169
#Using the union of the two methods
sele_variable <- c("target", "ps_car_06_cat","ps_car_13","ps_car_01_cat", "ps_car_14","ps_ind_15","ps_calc_10", "ps_ind_03", "ps_reg_02","ps_calc_14","ps_calc_11","ps_calc_03","ps_calc_01","ps_calc_02","ps_calc_08","ps_calc_13","ps_calc_07" ,"ps_car_15", "ps_ind_05_cat","ps_reg_01","ps_calc_06","ps_ind_01","ps_car_12","ps_calc_05","ps_calc_09", "ps_ind_17_bin","ps_ind_15", "ps_car_07_cat")

4.5.2.2 Train the Model

train_train_se <- train_train[,sele_variable]
train_test_se <- train_test[,sele_variable]
  
labels = train_train_se['target']
y <- labels$target

bstSparse_se <- xgboost(data = data.matrix(train_train_se[,-1]), label = y, max.depth = 30, eta = 0.05, nthread = 3, nround = 30, objective = "binary:logistic")
## [1]  train-error:0.037260 
## [2]  train-error:0.024817 
## [3]  train-error:0.021260 
## [4]  train-error:0.015080 
## [5]  train-error:0.013000 
## [6]  train-error:0.011526 
## [7]  train-error:0.010617 
## [8]  train-error:0.009423 
## [9]  train-error:0.008366 
## [10] train-error:0.007534 
## [11] train-error:0.006966 
## [12] train-error:0.006314 
## [13] train-error:0.005677 
## [14] train-error:0.005100 
## [15] train-error:0.004640 
## [16] train-error:0.004183 
## [17] train-error:0.003903 
## [18] train-error:0.003529 
## [19] train-error:0.003211 
## [20] train-error:0.003000 
## [21] train-error:0.002800 
## [22] train-error:0.002566 
## [23] train-error:0.002443 
## [24] train-error:0.002243 
## [25] train-error:0.002137 
## [26] train-error:0.002023 
## [27] train-error:0.001929 
## [28] train-error:0.001840 
## [29] train-error:0.001763 
## [30] train-error:0.001674
xgb_se_pred <- predict(bstSparse_se, data.matrix(train_test_se[,-1]))
xgb_se_pred <- data.frame(xgb_se_pred)
xgb_se_pred <- round(xgb_se_pred)
xgb_se_pred_compare <- cbind(xgb_se_pred, train_test_se$target)

4.5.2.3 Evaluate the Performance

confusionMatrix(as.factor(xgb_se_pred_compare$xgb_se_pred), as.factor(xgb_se_pred_compare$`train_test_se$target`))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 104113   1340
##          1   1062  43485
##                                                   
##                Accuracy : 0.9839867               
##                  95% CI : (0.9833387, 0.9846159)  
##     No Information Rate : 0.7011667               
##     P-Value [Acc > NIR] : < 0.00000000000000022204
##                                                   
##                   Kappa : 0.9617197               
##  Mcnemar's Test P-Value : 0.00000001586983        
##                                                   
##             Sensitivity : 0.9899025               
##             Specificity : 0.9701060               
##          Pos Pred Value : 0.9872929               
##          Neg Pred Value : 0.9761600               
##              Prevalence : 0.7011667               
##          Detection Rate : 0.6940867               
##    Detection Prevalence : 0.7030200               
##       Balanced Accuracy : 0.9800043               
##                                                   
##        'Positive' Class : 0                       
## 
roc_se_xgb_cor <- cor(xgb_se_pred_compare$xgb_se_pred, xgb_se_pred_compare$`train_test_se$target`)

pred_se<-ROCR::prediction(xgb_se_pred_compare$xgb_se_pred, xgb_se_pred_compare$`train_test_se$target`) 
roc_se_xgb<-performance(pred_se,measure="tpr", x.measure="fpr")
plot(roc_se_xgb, main="ROC curve for XGBoost Model(Selected Variables)", col="blue", lwd=3)
segments(0, 0, 1, 1, lty=2)

roc_se_xgb_auc<-performance(pred_se, measure="auc")
roc_se_xgb_auc@y.values
## [[1]]
## [1] 0.9800042555

4.5.2.4 Conclusion

This model doesn’t have better performance than the previous one, possibly because variables we select is from logistic regression and Random Forest, who don’t have a very good performance, so we can doubt that using the variables from logistic and the top 25 variables sorted by random forest is not reliable enough to fully predict the target.

5 Discussion and Conclusion

5.1 Compare the confusion matrix

Method Accuracy Kappa Sensitivity Specificity
Logistic Regression 0.7023 0.0491 0.9788 0.0578
PCA+C5.0 0.6883 0.0615 0.9347 0.1138
Random Forest 0.7533 0.2539 0.9838 0.2160
Neural Networks 0.7000 0.0064 0.9973 0.0072
XGBoost 0.9934 0.9842 0.9968 0.9853

5.2 Compare ROC

For this ROC curve plot, the plots are angle-shape elbow, it is because the true value is binary and the prediction is also binary. For the method Random Forest, since the output is a probability, not binary, so the curve is not angle-shape elbow

Method ROC area
Logistic Regression 0.5241
PCA+C5.0 0.5242
Random Forest 0.6869
Neural Networks 0.5022
XGBoost 0.9911

5.3 Conclusion

From the evaluation table above, we can conclude that in this case, the performance of XGBoost is the best, here are some reasons why the other algorithms perform poorly:

  • For Logistic Regression, it performs poorly when there are non-linear relationships, the model is not naturally flexible enough to capture more complex patterns. And since we don’t have prior knowledge, we can’t skip out the variables by hand.

  • For Neural Networks, the reason for the poor performance is that we are only able to fit the model with one hidden layer and limited neurons. Thus, the power of this model is not employed fully. This deep learning algorithm requires a very large amount of data and much more expertise to tune,thus it is computationally intensive to train.

  • For the C5.0 decision model, the unconstrained, individuals trees are prone to overfitting. And this weekness can be alleviated by using ensembles. That’s the reason why Random Forest has good performance.

  • For PCA, we realize that the primary components of this data are not useful, which may be the result that this data features are not closed to Gaussian.

For future extension, we hope the full data (6 million) can be used to train the models rather than constrained by operation speed of computers. And, for the neural network model, more layers and neurons can be applied in the model with elegant adjustment.

6 Acknowledgements

I gratefully acknowledge the help of Dr. Ivo Dinov during this semester, and also the expert advice and encouragement for our final project. It is a pleasure to be a student in this class, I really enjoy this course during the semester. And it is always happy to chat with you regarding professional problems.

7 References